24 research outputs found
Efficient Wait-k Models for Simultaneous Machine Translation
Simultaneous machine translation consists in starting output generation
before the entire input sequence is available. Wait-k decoders offer a simple
but efficient approach for this problem. They first read k source tokens, after
which they alternate between producing a target token and reading another
source token. We investigate the behavior of wait-k decoding in low resource
settings for spoken corpora using IWSLT datasets. We improve training of these
models using unidirectional encoders, and training across multiple values of k.
Experiments with Transformer and 2D-convolutional architectures show that our
wait-k models generalize well across a wide range of latency levels. We also
show that the 2D-convolution architecture is competitive with Transformers for
simultaneous translation of spoken language.Comment: Accepted at INTERSPEECH 202
Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
International audienceCurrent state-of-the-art machine translation systems are based on encoder-decoder architectures, that first encode the input sequence, and then generate an output sequence based on the input encoding. Both are interfaced with an attention mechanism that recombines a fixed encoding of the source tokens based on the decoder state. We propose an alternative approach which instead relies on a single 2D convolutional neural network across both sequences. Each layer of our network re-codes source tokens on the basis of the output sequence produced so far. Attention-like properties are therefore pervasive throughout the network. Our model yields excellent results, outperforming state-of-the-art encoder-decoder systems, while being conceptually simpler and having fewer parameters
Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity
Mixture-of-experts (MoE) models that employ sparse activation have
demonstrated effectiveness in significantly increasing the number of parameters
while maintaining low computational requirements per token. However, recent
studies have established that MoE models are inherently parameter-inefficient
as the improvement in performance diminishes with an increasing number of
experts. We hypothesize this parameter inefficiency is a result of all experts
having equal capacity, which may not adequately meet the varying complexity
requirements of different tokens or tasks. In light of this, we propose
Stratified Mixture of Experts (SMoE) models, which feature a stratified
structure and can assign dynamic capacity to different tokens. We demonstrate
the effectiveness of SMoE on three multilingual machine translation benchmarks,
containing 4, 15, and 94 language pairs, respectively. We show that SMoE
outperforms multiple state-of-the-art MoE models with the same or fewer
parameters.Comment: Accepted at Findings of EMNLP 202
Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation
Added toxicity in the context of translation refers to the fact of producing
a translation output with more toxicity than there exists in the input. In this
paper, we present MinTox which is a novel pipeline to identify added toxicity
and mitigate this issue which works at inference time. MinTox uses a toxicity
detection classifier which is multimodal (speech and text) and works in
languages at scale. The mitigation method is applied to languages at scale and
directly in text outputs. MinTox is applied to SEAMLESSM4T, which is the latest
multimodal and massively multilingual machine translation system. For this
system, MinTox achieves significant added toxicity mitigation across domains,
modalities and language directions. MinTox manages to approximately filter out
from 25% to 95% of added toxicity (depending on the modality and domain) while
keeping translation quality
Depth-adaptive Transformer
International audienceState of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict how much computation is required for a particular sequence. Unlike dynamic computation in Universal Transformers, which applies the same set of layers iteratively, we apply different layers at every step to adjust both the amount of computation as well as the model capacity. On IWSLT German-English translation our approach matches the accuracy of a well tuned baseline Transformer while using less than a quarter of the decoder layers
Online Versus Offline NMT Quality: An In-depth Analysis on English-German and German-English
We conduct in this work an evaluation study comparing offline and online
neural machine translation architectures. Two sequence-to-sequence models:
convolutional Pervasive Attention (Elbayad et al. 2018) and attention-based
Transformer (Vaswani et al. 2017) are considered. We investigate, for both
architectures, the impact of online decoding constraints on the translation
quality through a carefully designed human evaluation on English-German and
German-English language pairs, the latter being particularly sensitive to
latency constraints. The evaluation results allow us to identify the strengths
and shortcomings of each model when we shift to the online setup.Comment: Accepted at COLING 202
ON-TRAC Consortium for End-to-End and Simultaneous Speech Translation Challenge Tasks at IWSLT 2020
This paper describes the ON-TRAC Consortium translation systems developed for
two challenge tracks featured in the Evaluation Campaign of IWSLT 2020, offline
speech translation and simultaneous speech translation. ON-TRAC Consortium is
composed of researchers from three French academic laboratories: LIA (Avignon
Universit\'e), LIG (Universit\'e Grenoble Alpes), and LIUM (Le Mans
Universit\'e). Attention-based encoder-decoder models, trained end-to-end, were
used for our submissions to the offline speech translation track. Our
contributions focused on data augmentation and ensembling of multiple models.
In the simultaneous speech translation track, we build on Transformer-based
wait-k models for the text-to-text subtask. For speech-to-text simultaneous
translation, we attach a wait-k MT system to a hybrid ASR system. We propose an
algorithm to control the latency of the ASR+MT cascade and achieve a good
latency-quality trade-off on both subtasks